Initialise

Initialise by loading data.

load("druguse.RData") #Load data

Question 1

Question 1.1

Using ggplot() and geom_bar() create a plot to compare drug use level within each country and rename the labels accordingly.

histcountryuselevel <- ggplot(druguse,aes(x = druguse$country, fill = druguse$UseLevel)) + geom_bar() #Create bar plot
histcountryuselevel + xlab("Country") + ylab("Use Level Count") + labs(fill='Use Level') #Assign x, y and legend labels

Replacing the x with the gender index of the druguse data create another plot comparing drug use by gender.

histcountrygender <- ggplot(druguse,aes(x = druguse$gender, fill = druguse$UseLevel)) + geom_bar() #Create bar plot
histcountrygender+ xlab("Gender") + ylab("Use Level Count") + labs(fill='Use Level') #Assign x, y and legend labels

Question 1.2

Using summary() we can show summary statistics about the data in druguse and str() the structure of the data frame.

summary(druguse) #Summary of the data frame
##   agegroup      gender      education                      country    
##  18-24:643   female:942   Min.   :-2.435910   Australia        :  54  
##  25-34:481   male  :943   1st Qu.:-0.611130   Canada           :  87  
##  35-44:356                Median :-0.059210   NewZealand       :   5  
##  45-54:294                Mean   :-0.003806   Other            : 118  
##  55-64: 93                3rd Qu.: 0.454680   RepublicofIreland:  20  
##  65+  : 18                Max.   : 1.984370   UK               :1044  
##                                               USA              : 557  
##              ethnicity     neuroticism         extraversion      
##  Asian            :  26   Min.   :-3.464360   Min.   :-3.273930  
##  Black            :  33   1st Qu.:-0.678250   1st Qu.:-0.695090  
##  Mixed-Black/Asian:   3   Median : 0.042570   Median : 0.003320  
##  Mixed-White/Asian:  20   Mean   : 0.000047   Mean   :-0.000163  
##  Mixed-White/Black:  20   3rd Qu.: 0.629670   3rd Qu.: 0.637790  
##  Other            :  63   Max.   : 3.273930   Max.   : 3.273930  
##  White            :1720                                          
##  opentoexperience    agreeableness       conscientiousness  
##  Min.   :-3.273930   Min.   :-3.464360   Min.   :-3.464360  
##  1st Qu.:-0.717270   1st Qu.:-0.606330   1st Qu.:-0.652530  
##  Median :-0.019280   Median :-0.017290   Median :-0.006650  
##  Mean   :-0.000534   Mean   :-0.000245   Mean   :-0.000386  
##  3rd Qu.: 0.723300   3rd Qu.: 0.760960   3rd Qu.: 0.584890  
##  Max.   : 2.901610   Max.   : 3.464360   Max.   : 3.464360  
##                                                             
##  impulsiveness         sensation            caffeine       chocolate    
##  Min.   :-2.555240   Min.   :-2.078480   Min.   :0.000   Min.   :0.000  
##  1st Qu.:-0.711260   1st Qu.:-0.525930   1st Qu.:5.000   1st Qu.:5.000  
##  Median :-0.217120   Median : 0.079870   Median :6.000   Median :5.000  
##  Mean   : 0.007216   Mean   :-0.003292   Mean   :5.484   Mean   :5.107  
##  3rd Qu.: 0.529750   3rd Qu.: 0.765400   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   : 2.901610   Max.   : 1.921730   Max.   :6.000   Max.   :6.000  
##                                                                         
##     nicotine        alcohol       amphetamine     amylnitrite    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:4.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :3.000   Median :5.000   Median :0.000   Median :0.0000  
##  Mean   :3.201   Mean   :4.635   Mean   :1.341   Mean   :0.6069  
##  3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.0000  
##                                                                  
##    benzodiaz        cannabis        cocaine          crack       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.000   Median :3.000   Median :0.000   Median :0.0000  
##  Mean   :1.465   Mean   :2.989   Mean   :1.161   Mean   :0.2976  
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.0000  
##                                                                  
##     ecstasy          heroin         ketamine        legalhighs   
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :0.0000   Median :0.000  
##  Mean   :1.314   Mean   :0.374   Mean   :0.5692   Mean   :1.356  
##  3rd Qu.:3.000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:3.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.0000   Max.   :6.000  
##                                                                  
##       LSD          methadone        mushrooms       volatiles     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :1.062   Mean   :0.8265   Mean   :1.187   Mean   :0.4334  
##  3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :6.000   Max.   :6.0000   Max.   :6.000   Max.   :6.0000  
##                                                                   
##     any          severity     UseLevel   
##  FALSE: 195   Min.   : 0.00   low : 849  
##  TRUE :1690   1st Qu.: 5.00   high:1036  
##               Median :16.00              
##               Mean   :18.19              
##               3rd Qu.:29.00              
##               Max.   :64.00              
## 
str(druguse) #Show structure of the data frame
## 'data.frame':    1885 obs. of  33 variables:
##  $ agegroup         : Factor w/ 6 levels "18-24","25-34",..: 1 1 2 2 3 1 1 2 3 1 ...
##  $ gender           : Factor w/ 2 levels "female","male": 1 2 1 2 1 1 2 1 1 2 ...
##  $ education        : num  0.455 -0.611 1.164 1.164 0.455 ...
##  $ country          : Factor w/ 7 levels "Australia","Canada",..: 6 7 6 6 6 7 7 6 6 6 ...
##  $ ethnicity        : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ neuroticism      : num  -0.1488 -0.921 1.0212 -1.5508 -0.0519 ...
##  $ extraversion     : num  -0.948 1.741 0.962 1.114 0.962 ...
##  $ opentoexperience : num  -1.4242 0.5833 -0.9763 -0.0193 -1.119 ...
##  $ agreeableness    : num  -0.4532 -1.3429 -0.6063 -0.0173 0.9416 ...
##  $ conscientiousness: num  0.26 1.134 -0.143 1.462 1.631 ...
##  $ impulsiveness    : num  0.53 0.53 -1.38 0.193 -1.38 ...
##  $ sensation        : num  -0.8464 1.2247 -1.1808 0.0799 -2.0785 ...
##  $ caffeine         : num  4 6 6 6 5 6 5 6 6 5 ...
##  $ chocolate        : num  5 5 6 5 6 6 4 6 6 6 ...
##  $ nicotine         : num  2 6 5 4 0 6 4 3 0 4 ...
##  $ alcohol          : num  4 4 5 5 3 5 3 3 5 5 ...
##  $ amphetamine      : num  0 3 0 0 0 0 2 0 0 4 ...
##  $ amylnitrite      : num  0 0 2 0 0 0 3 0 0 2 ...
##  $ benzodiaz        : num  0 2 0 0 0 0 3 0 0 3 ...
##  $ cannabis         : num  2 5 3 2 0 6 5 0 1 4 ...
##  $ cocaine          : num  0 3 2 0 0 0 2 0 0 2 ...
##  $ crack            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ecstasy          : num  0 0 0 0 0 0 3 0 0 4 ...
##  $ heroin           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ketamine         : num  0 0 0 0 0 0 0 0 0 4 ...
##  $ legalhighs       : num  0 3 2 4 0 5 3 0 0 3 ...
##  $ LSD              : num  0 0 0 0 0 0 3 0 0 2 ...
##  $ methadone        : num  0 0 0 0 0 2 4 0 0 0 ...
##  $ mushrooms        : num  0 3 0 0 0 3 3 0 0 2 ...
##  $ volatiles        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ any              : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 1 2 2 2 2 2 ...
##  $ severity         : num  4 25 14 10 0 22 35 3 1 34 ...
##  $ UseLevel         : Factor w/ 2 levels "low","high": 1 2 2 1 1 2 2 1 1 2 ...

Using pairs we can plot all the scatter plots possible with each pair of indicies to show any relationships between them visually.

pairs(druguse, gap=0)#apply pairs to data frame 

We can better visualise the relationships shown by pairs by calculating the correlation matrix. Converting all of the data frame to numerics using lapply and as.numeric so that we can create a correlation matrix of the data using cor. We reshape the matrix using melt from the package reshape2 so that we can plot a heatmap of the correlogram in ggplot.

druguse_num <- data.frame(lapply(druguse,as.numeric)) #convert all columns to a numeric
correlation <- cor(druguse_num) #calculate correlations
melted_cor <- melt(correlation) #use melt to reshape correlation matrix into a dataframe for ggplot

#create a heatmap of the correlation matrix in ggplot
corplot <- ggplot(data=melted_cor, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + xlab("") + ylab("") +ggtitle("Correlation Heatmap") #create plot using geom_tile and assaign labels and title
corplot + geom_text(aes(Var2, Var1, label = round(value,2)),
                    color = "black", size = 2) + theme(axis.text.x = element_text(angle = 90),
                                                       plot.title = element_text(hjust = 0.5)) + scale_fill_gradientn(colours = rainbow(20)) #modify the text angle, colour and add corelation values onto each tile

This allows us to easily visualise the relationships in the data so that we can determine what to use as a predictor. We can order the correlation values and remove duplicates and correlations with its self. Then use head and tail to show highest and lowest correlation pairs.

melted_cor_ord1 <- melted_cor[order(melted_cor$value ,decreasing = T),] #order correlations
melted_cor_ord2 <- melted_cor_ord1[seq(34,nrow(melted_cor_ord1),2),] #remove duplicates and correlation with self
head(melted_cor_ord2, n=10) #show most positively correlated 
##           Var1        Var2     value
## 1056  UseLevel    severity 0.8279854
## 758   severity     ecstasy 0.7517185
## 660   UseLevel    cannabis 0.7464855
## 659   severity    cannabis 0.7458712
## 692   severity     cocaine 0.7287190
## 560   severity amphetamine 0.7244818
## 857   severity  legalhighs 0.7068349
## 956   severity   mushrooms 0.7036681
## 890   severity         LSD 0.6713733
## 887  mushrooms         LSD 0.6686274
tail(melted_cor_ord2, n=10) #show most negatively correlated
##                  Var1              Var2      value
## 29          mushrooms          agegroup -0.3267421
## 12          sensation          agegroup -0.3275861
## 308     impulsiveness conscientiousness -0.3351333
## 33           UseLevel          agegroup -0.3768257
## 23            ecstasy          agegroup -0.3822633
## 175 conscientiousness       neuroticism -0.3910884
## 32           severity          agegroup -0.4072982
## 26         legalhighs          agegroup -0.4100335
## 172      extraversion       neuroticism -0.4310511
## 20           cannabis          agegroup -0.4370538

Using rcorr from Hmisc we can calculate the p values for each correlation pair to determine if the correlation is statistically significant at a given significance level. We can then similarly to the correlogram create a heat map of the p values using ggplot.

#use rcorr from Hmisc to calculate p values for the correlation of all elements and extract the p values 
cor_pvalues <- rcorr(as.matrix(druguse_num))$P
#melt to reshape for ggplot
melted_cor_pvalues <- melt(cor_pvalues)
melted_cor_pvalues[is.na(melted_cor_pvalues)] <- 0 #replace NAs with 0
#create a heatmap of the correlation pvalues matrix in ggplot
corplot <- ggplot(data=melted_cor_pvalues, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + xlab("") + ylab("") +ggtitle("Correlation Pvalues Heatmap")
corplot + geom_text(aes(Var2, Var1, label = round(value,2)),
                    color = "black", size = 2) + theme(axis.text.x = element_text(angle = 90),
                                                       plot.title = element_text(hjust = 0.5)) + scale_fill_gradientn(colours = c("blue","red")) 

Using this we can clearly see the high p values and thus which are not significantly related. We can extract the correlation pairs relating to UseLevel only to see its relationship with the predictors in the data set.

cor_UseLevel <- melted_cor_ord2[(melted_cor_ord2$Var1 == "UseLevel"),] #keep only correlations with UseLevel
cor_UseLevel #print
##          Var1              Var2       value
## 1056 UseLevel          severity  0.82798538
## 660  UseLevel          cannabis  0.74648553
## 759  UseLevel           ecstasy  0.65682304
## 858  UseLevel        legalhighs  0.61336785
## 957  UseLevel         mushrooms  0.61169870
## 693  UseLevel           cocaine  0.59065875
## 561  UseLevel       amphetamine  0.58671220
## 891  UseLevel               LSD  0.56739033
## 495  UseLevel          nicotine  0.53958655
## 627  UseLevel         benzodiaz  0.49578385
## 396  UseLevel         sensation  0.47205494
## 825  UseLevel          ketamine  0.40937284
## 924  UseLevel         methadone  0.40528902
## 1023 UseLevel               any  0.37523210
## 264  UseLevel  opentoexperience  0.37108294
## 594  UseLevel       amylnitrite  0.34200580
## 363  UseLevel     impulsiveness  0.33004475
## 990  UseLevel         volatiles  0.32468935
## 726  UseLevel             crack  0.31812696
## 66   UseLevel            gender  0.31716181
## 792  UseLevel            heroin  0.31697286
## 198  UseLevel       neuroticism  0.13967826
## 165  UseLevel         ethnicity  0.08650985
## 429  UseLevel          caffeine  0.06388204
## 132  UseLevel           country  0.00788457
## 528  UseLevel           alcohol -0.01672327
## 231  UseLevel      extraversion -0.03882249
## 462  UseLevel         chocolate -0.08955852
## 297  UseLevel     agreeableness -0.18077992
## 99   UseLevel         education -0.25774412
## 330  UseLevel conscientiousness -0.31499336
## 33   UseLevel          agegroup -0.37682565

We can also do this for the p values to determine which correlation pairs containing UseLevel are statistically significant.

#order p values highest to lowest
melted_cor_pvalues_ord <- melted_cor_pvalues[order(melted_cor_pvalues$value ,decreasing = T),]
head(melted_cor_pvalues_ord, 10)#show highest p values 
##                  Var1              Var2     value
## 141     agreeableness         ethnicity 0.9937756
## 269         ethnicity     agreeableness 0.9937756
## 313           alcohol conscientiousness 0.9929622
## 505 conscientiousness           alcohol 0.9929622
## 311         chocolate conscientiousness 0.9876450
## 439 conscientiousness         chocolate 0.9876450
## 162         volatiles         ethnicity 0.9818996
## 962         ethnicity         volatiles 0.9818996
## 245         chocolate  opentoexperience 0.9571145
## 437  opentoexperience         chocolate 0.9571145
tail(melted_cor_pvalues_ord, 10)#show lowest p values
##            Var1     Var2 value
## 1080     heroin UseLevel     0
## 1081   ketamine UseLevel     0
## 1082 legalhighs UseLevel     0
## 1083        LSD UseLevel     0
## 1084  methadone UseLevel     0
## 1085  mushrooms UseLevel     0
## 1086  volatiles UseLevel     0
## 1087        any UseLevel     0
## 1088   severity UseLevel     0
## 1089   UseLevel UseLevel     0

We can then choose a significance level to test at in this case 5% to show which predictors are unlikely to be correlated with UseLevel.

#find p values relating to use level
pvalues_cor_uselevel <- melted_cor_pvalues_ord[melted_cor_pvalues_ord$Var1 == "UseLevel",] #extract just those relating to UseLevel
notsignificant <- pvalues_cor_uselevel[pvalues_cor_uselevel$value >= 0.05,] #not significant at 5% significance level
pvalues_cor_uselevel #print
##          Var1              Var2        value
## 132  UseLevel           country 7.322761e-01
## 528  UseLevel           alcohol 4.680617e-01
## 231  UseLevel      extraversion 9.197753e-02
## 429  UseLevel          caffeine 5.528074e-03
## 165  UseLevel         ethnicity 1.695587e-04
## 462  UseLevel         chocolate 9.877888e-05
## 198  UseLevel       neuroticism 1.127642e-09
## 297  UseLevel     agreeableness 2.664535e-15
## 33   UseLevel          agegroup 0.000000e+00
## 66   UseLevel            gender 0.000000e+00
## 99   UseLevel         education 0.000000e+00
## 264  UseLevel  opentoexperience 0.000000e+00
## 330  UseLevel conscientiousness 0.000000e+00
## 363  UseLevel     impulsiveness 0.000000e+00
## 396  UseLevel         sensation 0.000000e+00
## 495  UseLevel          nicotine 0.000000e+00
## 561  UseLevel       amphetamine 0.000000e+00
## 594  UseLevel       amylnitrite 0.000000e+00
## 627  UseLevel         benzodiaz 0.000000e+00
## 660  UseLevel          cannabis 0.000000e+00
## 693  UseLevel           cocaine 0.000000e+00
## 726  UseLevel             crack 0.000000e+00
## 759  UseLevel           ecstasy 0.000000e+00
## 792  UseLevel            heroin 0.000000e+00
## 825  UseLevel          ketamine 0.000000e+00
## 858  UseLevel        legalhighs 0.000000e+00
## 891  UseLevel               LSD 0.000000e+00
## 924  UseLevel         methadone 0.000000e+00
## 957  UseLevel         mushrooms 0.000000e+00
## 990  UseLevel         volatiles 0.000000e+00
## 1023 UseLevel               any 0.000000e+00
## 1056 UseLevel          severity 0.000000e+00
## 1089 UseLevel          UseLevel 0.000000e+00
notsignificant #print
##         Var1         Var2      value
## 132 UseLevel      country 0.73227613
## 528 UseLevel      alcohol 0.46806166
## 231 UseLevel extraversion 0.09197753

Using plotly we can create a interactive 3d scatter plot with symbols for each UseLevel and 3 significant predictors for use level on the axises. The plot appears to show two groups suggesting that what we infered previously from the correlations and p values is true.

#create new data frame as all numerical except uselevel otherwise we can't add a legend 
druguse_num_p <- druguse_num
druguse_num_p$UseLevel <- druguse$UseLevel
#create 3d plot of non significant predictors
sp3 <- plot_ly(druguse_num_p, x = ~nicotine, y = ~sensation, z= ~opentoexperience, mode = 'markers',
               color = I('blue'), symbol = ~UseLevel, symbols = c('circle','o')) %>%#create plot and set x,y,z
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Nicotine'),
                     yaxis = list(title = 'Sensation'),
                     zaxis = list(title = 'Open to experience')), title="3D plot of 3 significant predictors of UseLevel") #set axis labels
sp3

Similarly to before we can extract the p values for severity to show which predictors are statistically significant for predicting severity.

pvalues_cor_severity <- melted_cor_pvalues_ord[melted_cor_pvalues_ord$Var1 == "severity",]#find p values relating to severity
pvalues_cor_severity #print
##          Var1              Var2        value
## 131  severity           country 3.624009e-01
## 230  severity      extraversion 1.618099e-01
## 527  severity           alcohol 1.128850e-01
## 428  severity          caffeine 1.795644e-03
## 164  severity         ethnicity 1.184769e-04
## 461  severity         chocolate 1.181627e-04
## 197  severity       neuroticism 8.881784e-16
## 32   severity          agegroup 0.000000e+00
## 65   severity            gender 0.000000e+00
## 98   severity         education 0.000000e+00
## 263  severity  opentoexperience 0.000000e+00
## 296  severity     agreeableness 0.000000e+00
## 329  severity conscientiousness 0.000000e+00
## 362  severity     impulsiveness 0.000000e+00
## 395  severity         sensation 0.000000e+00
## 494  severity          nicotine 0.000000e+00
## 560  severity       amphetamine 0.000000e+00
## 593  severity       amylnitrite 0.000000e+00
## 626  severity         benzodiaz 0.000000e+00
## 659  severity          cannabis 0.000000e+00
## 692  severity           cocaine 0.000000e+00
## 725  severity             crack 0.000000e+00
## 758  severity           ecstasy 0.000000e+00
## 791  severity            heroin 0.000000e+00
## 824  severity          ketamine 0.000000e+00
## 857  severity        legalhighs 0.000000e+00
## 890  severity               LSD 0.000000e+00
## 923  severity         methadone 0.000000e+00
## 956  severity         mushrooms 0.000000e+00
## 989  severity         volatiles 0.000000e+00
## 1022 severity               any 0.000000e+00
## 1055 severity          severity 0.000000e+00
## 1088 severity          UseLevel 0.000000e+00

We can create a 2D interactive histogram contour plot of severity and age group to show how they are distributed relative to one another.

hc1 <- subplot(#use subplot so we can plot the histograms along with the contours
  plot_ly(data=druguse,x =~agegroup, type = 'histogram', name = 'Age group'), 
  plotly_empty(), 
  plot_ly(druguse,x =~agegroup, y =~severity, type = 'histogram2dcontour', showscale = T, name = 'Contour'), 
  plot_ly(druguse,y =~severity, type = 'histogram', name = 'Severity'),
  nrows = 2, heights = c(0.2, 0.8), widths = c(0.8, 0.2), 
  shareX = TRUE, shareY = TRUE, titleX = FALSE, titleY = FALSE
)

hc2 <- layout(hc1, showlegend = F, xaxis = list(title = 'Age group'), yaxis = list(title = 'Severity'), title = 'Histogram contour of severity and age group') #add titles and axises labels

hc2 #show plot

This shows how the data is distributed relative to one another with the countours showing clusters in the data. We can show how each predictor is distributed individually and relative to UseLevel by plotting histograms of each.

par(mfrow=c(4,4))#set 4x4 space in plotting device
for (n in 1:16){#loop for first 16 predictors
  t <- sprintf('Histogram of %s', colnames(druguse_num)[n])#create names for title
  hist(druguse_num[,n], main = t, xlab = sprintf('%s', colnames(druguse_num)[n]))#create histogram
}

Question 2

Question 2.1

Create validation and training datasets.

druguse_predictors <- druguse[c(1:16,33)] #Data set of just the predictors
druguse_train <- druguse_predictors[1:1400,] #Training dataset  
druguse_val <- druguse_predictors[1401:1885,] #Test dataset

Create a logistic regression model by using glm and specifing link='logit' for a logistic regression. Output a summary of the model

model <- glm(UseLevel ~ ., family=binomial(link='logit'), data=druguse_train)#Create a logistic regression classifier using all predictors to predict UseLevel
summary(model) #Summarise the model
## 
## Call:
## glm(formula = UseLevel ~ ., family = binomial(link = "logit"), 
##     data = druguse_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3880  -0.3865   0.1221   0.4357   2.8387  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -0.84266    1.13278  -0.744  0.45694    
## agegroup25-34                 0.26901    0.24571   1.095  0.27359    
## agegroup35-44                -0.17433    0.25605  -0.681  0.49598    
## agegroup45-54                -0.65564    0.27445  -2.389  0.01690 *  
## agegroup55-64                -0.98729    0.40893  -2.414  0.01576 *  
## agegroup65+                 -16.81790  502.60332  -0.033  0.97331    
## gendermale                    0.93652    0.18530   5.054 4.33e-07 ***
## education                    -0.25677    0.09947  -2.581  0.00984 ** 
## countryCanada                -0.81196    0.62733  -1.294  0.19556    
## countryNewZealand            -0.43152    1.48226  -0.291  0.77096    
## countryOther                 -1.00621    0.61469  -1.637  0.10164    
## countryRepublicofIreland     -1.61533    0.95965  -1.683  0.09233 .  
## countryUK                    -2.10637    0.51748  -4.070 4.69e-05 ***
## countryUSA                    0.07998    0.54650   0.146  0.88364    
## ethnicityBlack               -0.46926    1.21613  -0.386  0.69960    
## ethnicityMixed-Black/Asian   13.17401 2399.54491   0.005  0.99562    
## ethnicityMixed-White/Asian    1.34439    1.34799   0.997  0.31860    
## ethnicityMixed-White/Black    0.61315    1.13218   0.542  0.58812    
## ethnicityOther                0.41618    0.97087   0.429  0.66816    
## ethnicityWhite                1.06211    0.85269   1.246  0.21291    
## neuroticism                  -0.06843    0.10697  -0.640  0.52234    
## extraversion                 -0.13871    0.11164  -1.242  0.21406    
## opentoexperience              0.56631    0.10678   5.304 1.14e-07 ***
## agreeableness                -0.09095    0.09740  -0.934  0.35041    
## conscientiousness            -0.31084    0.10487  -2.964  0.00304 ** 
## impulsiveness                -0.01198    0.11249  -0.107  0.91517    
## sensation                     0.72043    0.13213   5.453 4.96e-08 ***
## caffeine                      0.07957    0.08073   0.986  0.32431    
## chocolate                    -0.09428    0.08338  -1.131  0.25815    
## nicotine                      0.48841    0.03924  12.447  < 2e-16 ***
## alcohol                      -0.06487    0.06890  -0.942  0.34642    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1923.39  on 1399  degrees of freedom
## Residual deviance:  894.64  on 1369  degrees of freedom
## AIC: 956.64
## 
## Number of Fisher Scoring iterations: 15

Looking at the summary of the model we can use the coefficents of each predictor in the model to determine whether a smoker is more or less likely to have a high use level. The coefficent of nicotine is 0.48841 which is positive suggesting smokers are more likely and as the p value is low this suggests its statistically significant also. Chocolate has a cofficent of -0.09428 which is negative suggesting they are less likely to have a high use level however it has a high p value suggesting this might not be statistically significant and in fact there is no relationship.

Question 2.2

Create predictions using the predict function and if its greater than 0.5 assaign it the value 1 for high and 0 for below 0.5 indicating low. Then calculate the errors of each type, correct predictions and create a table displaying them.

pp=predict(model, newdata=druguse_val)#Create predictions 
mypreds=ifelse(pp>0.5, 1,0) # predict High if pp > 0.5 and predict Low otherwise.

#calculate errors of each type
TP=sum(mypreds==1 & druguse_val$UseLevel=="high")
FP=sum(mypreds==1 & druguse_val$UseLevel=="low")
TN=sum(mypreds==0 & druguse_val$UseLevel=="low")
FN=sum(mypreds==0 & druguse_val$UseLevel=="high")

#make table
results <- matrix(c(TP,FP,FN,TN), ncol=2, byrow=TRUE)
#assign names
rownames(results) <- c("Predict High","Predict Low")
colnames(results) <- c("Actually High", "Actaully Low")
results <- as.table(results)
results #print
##              Actually High Actaully Low
## Predict High           207           26
## Predict Low             51          201

Question 2.3

Determine the accuracy.

Accuracy <- (TP+TN)/(TP+TN+FN+FP); Accuracy #calculate accuracy and print
## [1] 0.8412371

Create prediction function and use performance to create the ROC curve.

pr <- prediction(pp, as.numeric(druguse_val$UseLevel)) # create the function
prf <- performance(pr, measure = "tpr", x.measure = "fpr") # find the performance
plot(prf, main="ROC of logistic regression model") # plot it

We can also calculate the area under the ROC curve the AUC as a quantative measure of performance of the model.

#calculate area under the curve
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc 
## [1] 0.926903

The ROC curve suggests it is a good model as the closer the curve is to the top left corner the better the model. As this shows it has a high True positive rate with a low false postive rate which is what we want in a good model. The AUC is the area under the curve as this is close to on this also suggests its a good predictor as an AUC of 1 is a perfect model.

Question 2.4

We can split the model into 10 equal size pieces and train on 9 of them and validate on 1 until weve used all 10 folds as the validation data. Averaging the accuracy over the 10 iterations we can estimate the accuracy of the model on a new data set.

folds <- cut(seq(1,nrow(druguse_predictors)),breaks=10,labels=FALSE)#cut data into 10 blocks
Accuracy <- c()#create place to store accuracies in memory
for(i in 1:10){
    testIndexes <- which(folds==i,arr.ind=TRUE)#create indicies for test and training data
    testingData <- druguse_predictors[testIndexes, ]#create test data
    trainingData <- druguse_predictors[-testIndexes, ]#create training data
    model <- glm(UseLevel ~ ., family=binomial(link='logit'), data=trainingData)#Create a logistic regression classifier using all predictors to predict UseLevel
    pp=predict(model, newdata=testingData)#Create predictions 
    mypreds=ifelse(pp>0.5, 1,0) # predict High if pp > 0.5 and predict Low otherwise.
    
    #calculate errors and correct predictions
    TP=sum(mypreds==1 & testingData$UseLevel=="high")
    FP=sum(mypreds==1 & testingData$UseLevel=="low")
    TN=sum(mypreds==0 & testingData$UseLevel=="low")
    FN=sum(mypreds==0 & testingData$UseLevel=="high")

    Accuracy[i] <- (TP+TN)/(TP+TN+FN+FP)#store accuracy
}
summary(Accuracy)#use summary to see the average accuracy and other metrics
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7619  0.8320  0.8409  0.8372  0.8537  0.8677

This provides a look at how the model will perform in general with the average and metrics such as max, min and quartiles to show how much the performance could vary in new data sets.

Question 3

Question 3.1

First we create a numerical version of the data set removing predictors with a high p value that we discovered in our data analysis. We also scale the data as it will help improve the accuracies of our classifiers and also how fast methods like neural nets converge.

#remove predictors with high p value
druguse_num <- druguse_num[,c(1:3,5,6,8:15)]
#scale data using scale function and recreate dataframe
druguse_num_s <- data.frame(lapply(druguse_num, scale))
#add drug UseLevel column to dataframe
druguse_num_s$UseLevel <- druguse$UseLevel

Bootstrapping

We will now use bootstrapping to determine the performance of the classifier on a new dataset. We randomly select a sample with replacement for training and then use rest of the data set to validate on.

set.seed(1729)#set seed to Hardy-Ramanujan number so results using bootstrapping are consistent

Naive Bayes

The naive bayes classifier assumes the predictors are conditionally independent our data analysis suggests his may not be true thus affecting the performance of this classifier. Using bayes theorem we calculate the relative probabilities of a new observation vector being of class high or low given the values in the training set and classify based on which has the highest probability.

Accuracybayes <- matrix(NA,nrow=10,ncol=1) #create place to store data
colnames(Accuracybayes) <- "bayes"#name column
for(i in 1:10){
    bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
    
    modelbayes <- naiveBayes(UseLevel ~ ., data=druguse_num_s[bootsample,])#Create a naive bayes classifier for uselevel
    pbayes <- predict(modelbayes, newdata=druguse_num_s[outofbag,])#Create predictions
    
    Accuracybayes[i,1] <- sum(pbayes==druguse_num_s$UseLevel[outofbag])/length(outofbag)#calculate accuracy
}
summary(Accuracybayes)#sumarise accuracy
##      bayes       
##  Min.   :0.7850  
##  1st Qu.:0.8069  
##  Median :0.8197  
##  Mean   :0.8213  
##  3rd Qu.:0.8333  
##  Max.   :0.8580

Naive bayes provides a high mean accuracy however other techniques that make less assumptions may offer higher accuracies.

Linear Discrimanent Analysis

Linear discriminanent analysis assumes a gaussian distribution or in this multinomial case a multivariate Gaussian distribution with a common covariance matrix for each class. It uses the training data to calculate the parameters of the multinomial Gaussian for each class by maximum likelihood. The decision boundaries are the the lines perpendicular to the lines joining the centroids of each classes gaussian. This classifies a new data point in which ever class has the highest probability essentially as we have assumed the covariance matricies are the same.

Accuracylda <- matrix(NA,nrow=10,ncol=1) #create place to store data
colnames(Accuracylda) <-"LDA"#name column
for(i in 1:10){
    bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
    
    modellda <- lda(UseLevel ~ ., data=druguse_num_s[bootsample,])#Create a linear discrimanent analysis classifier using predictors to predict UseLevel
    pplda=predict(modellda, newdata=druguse_num_s[outofbag,])#Create predictions 

    Accuracylda[i] <- sum(pplda$class==druguse_num_s$UseLevel[outofbag])/length(outofbag)#calculate accuracy
}
summary(Accuracylda)#summarise accuracy
##       LDA        
##  Min.   :0.8131  
##  1st Qu.:0.8307  
##  Median :0.8378  
##  Mean   :0.8356  
##  3rd Qu.:0.8432  
##  Max.   :0.8487

LDA offers a higher accuracy than Naive Bayes we do have to assume a gaussian distribution but our data analysis suggests that most predictors are gaussian distributed. Thus suggesting that a multinomial gaussian distribution for predicting use level may be true. The range of accuracies is also lower.

Mixture Discrimant Analysis

MDA assumes a Gaussian distribution like LDA but splits each class into subclasses that are assumed to be Gaussian and predicts the parameters of these Gaussians using maximum likelihood estimates. Splitting the classes into subclasses can allow MDA to create better decision boundaries than LDA.

Accuracymda <- matrix(NA, nrow = 10, ncol = 5)#create place to store accuracies
colnames(Accuracymda) <- sprintf("MDA subclasses=%s", 1:5)#name columns according to number of subclasses
for (r in 1:5){#test different numbers of subclasses
  for(i in 1:10){
      bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
      outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
      
      modelmda <- mda(UseLevel ~ ., data=druguse_num_s[bootsample,], subclasses = r)#Create a mda classifier to predict UseLevel with r subclasses
      pmda <- predict(modelmda, newdata=druguse_num_s[outofbag,])#Create predictions 
      
      Accuracymda[i,r] <- sum(pmda==druguse_num_s$UseLevel[outofbag])/length(outofbag)#calculate accuracy
  }
}
summary(Accuracymda)#summarise accuracy
##  MDA subclasses=1 MDA subclasses=2 MDA subclasses=3 MDA subclasses=4
##  Min.   :0.8152   Min.   :0.8084   Min.   :0.8097   Min.   :0.8009  
##  1st Qu.:0.8216   1st Qu.:0.8236   1st Qu.:0.8237   1st Qu.:0.8130  
##  Median :0.8315   Median :0.8332   Median :0.8347   Median :0.8348  
##  Mean   :0.8281   Mean   :0.8304   Mean   :0.8341   Mean   :0.8296  
##  3rd Qu.:0.8341   3rd Qu.:0.8360   3rd Qu.:0.8474   3rd Qu.:0.8429  
##  Max.   :0.8379   Max.   :0.8443   Max.   :0.8500   Max.   :0.8605  
##  MDA subclasses=5
##  Min.   :0.8072  
##  1st Qu.:0.8199  
##  Median :0.8268  
##  Mean   :0.8278  
##  3rd Qu.:0.8356  
##  Max.   :0.8447

MDA has a similar accuracy to LDA on average but a higher range suggesting splitting into subclasses has caused the model to overfit to the training data in some cases. 3 subclasses seems to provide the best accuracy but the accuracies are similar in each case so this may not be significant however the range of accuracies increases with the number of subclasses which suggests overfitting.

Support Vector Machines

SVMs find a hyperplane to seperate the two classes. It finds this by maximising the perpendicular distance between the hyperplane and the data points of the two classes. One advantage of SVMs is you can change the shape of the hyperplane by using a kernel to allow for non-linear hyper planes. In this case we test polynomial, radial and sigmoid kernels as well as linear to see if they offer a better accuracy.

Accuracysvms <- matrix(NA, nrow = 10, ncol = 4)#allocate memory to store accuracies
kernels <- c("linear", "polynomial", "radial", "sigmoid")#store kernel types so models can be created in for loop
colnames(Accuracysvms) <- sprintf("SVM %s", kernels) #name columns of accuracies matrix
for (r in 1:4){#repeat for each kernal type
  for (i in 1:10){#repeat 10 times
    bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample inicies
    outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data
    
    modelsvms <- svm(UseLevel~.,data = druguse_num_s[bootsample,], kernel=kernels[r])#create svm with kernel type kernels[r]
    psvms <- predict(modelsvms, newdata=druguse_num_s[outofbag,])#predict on validation data
    
    Accuracysvms[i,r] <- sum(psvms==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy 
  }
}
summary(Accuracysvms)#show summary of accuracies
##    SVM linear     SVM polynomial     SVM radial      SVM sigmoid    
##  Min.   :0.8088   Min.   :0.8039   Min.   :0.8084   Min.   :0.7556  
##  1st Qu.:0.8293   1st Qu.:0.8077   1st Qu.:0.8199   1st Qu.:0.7611  
##  Median :0.8341   Median :0.8138   Median :0.8311   Median :0.7739  
##  Mean   :0.8361   Mean   :0.8152   Mean   :0.8293   Mean   :0.7738  
##  3rd Qu.:0.8424   3rd Qu.:0.8197   3rd Qu.:0.8374   3rd Qu.:0.7820  
##  Max.   :0.8728   Max.   :0.8302   Max.   :0.8502   Max.   :0.8000

To compare each kernel we can create a boxplot of the accuracies of each kernel type.

bp.svms <- boxplot(Accuracysvms, horizontal=T, main="SVM kernel accuracies", ylab="Kernel", xlab="Accuracy", col=c(2,5,7,8)) #create boxplot comparing kernels

Out of the 4 kernels the linear kernel performs the best on average but offers similar performance to LDA.

Neural Networks

We can use a neural network to predict the relationship between each predictor and use level. A neural network uses the training data to predict a function relating each of the predictors to use level. It does this by adjusting the weights of each node in the hidden layer at each iteration till it converges to a particular set of weights for each node. We can create a neural net with 10 nodes in the hidden layer and train using nnet on training data.

#create validation and training data from scaled data
druguse_train_num_s <- druguse_num_s[1:1400,]
druguse_val_num_s <- druguse_num_s[1401:1885,]

#create a neural network
nnmodel <- nnet(UseLevel~. , data=druguse_train_num_s, size = 10, maxit=1000) 
## # weights:  151
## initial  value 1004.917054 
## iter  10 value 503.141443
## iter  20 value 455.570938
## iter  30 value 416.521479
## iter  40 value 391.345281
## iter  50 value 374.640870
## iter  60 value 349.889873
## iter  70 value 328.538846
## iter  80 value 302.407847
## iter  90 value 280.595727
## iter 100 value 271.184208
## iter 110 value 267.644844
## iter 120 value 267.076157
## iter 130 value 267.026594
## iter 140 value 267.021381
## iter 150 value 267.017322
## iter 160 value 267.016200
## final  value 267.016001 
## converged

Calculate the accuracy and create a confusion table of the predictions on the validation dataset to evaluate the performance.

#make predictions and create confusion matrix
nnpredn = predict(nnmodel,druguse_val_num_s,type="class")
tablenn <- table(nnpredn, druguse_val_num_s$UseLevel)
#calculate correct and incorrect classifications of each type    
TP=tablenn[2]
FP=tablenn[1]
TN=tablenn[3]
FN=tablenn[4]
#calculate accuracy 
Accuracynn <- (TP+TN)/(TP+TN+FN+FP)  
#print
Accuracynn
## [1] 0.771134
tablenn
##        
## nnpredn low high
##    high  55  202
##    low  172   56

The neural network has a lower accuracy than previous methods. We can plot the model using plotnet to visualise what the network looks like.

#plot neural network
plotnet(nnmodel)

We can vary the number of nodes in the hidden layer to see if we can improve the accuracy and perform bootstrapping to see if we can improve the accuracy on a new dataset.

Accuracynn <- matrix(NA ,nrow=10,ncol=4) #allocate memory to store accuracies
colnames(Accuracynn) <- sprintf("NN size=%s",(1:4)*5)#name columns
for (r in 1:4){#for loop for testing model sizes 5,10,15,20
  for (i in 1:10){#for loop for bootstrapping each model
    bootsample <- sample(nrow(druguse_num_s),nrow(druguse_num_s), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_num_s), bootsample)#create validation data indicies
    #create model
    nnmodel <- nnet(UseLevel~. , data=druguse_num_s[bootsample,], size = (r*5), maxit=1000, trace=F) #Train model and dont output trace
    #predict on validation data and specify it as a classification prediction
    nnpredn <- predict(nnmodel, druguse_num_s[outofbag,], type="class")
    #calculate accuracy and store it
    Accuracynn[i,r] <- sum(nnpredn==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy     
  }
}
summary(Accuracynn) #summarise accuracies of each neural network
##    NN size=5        NN size=10       NN size=15       NN size=20    
##  Min.   :0.7669   Min.   :0.7454   Min.   :0.7387   Min.   :0.7455  
##  1st Qu.:0.7883   1st Qu.:0.7614   1st Qu.:0.7476   1st Qu.:0.7488  
##  Median :0.7948   Median :0.7668   Median :0.7668   Median :0.7562  
##  Mean   :0.7940   Mean   :0.7670   Mean   :0.7656   Mean   :0.7584  
##  3rd Qu.:0.8036   3rd Qu.:0.7708   3rd Qu.:0.7759   3rd Qu.:0.7696  
##  Max.   :0.8122   Max.   :0.7917   Max.   :0.8100   Max.   :0.7721

We can summarise the accuracies to see the performance of the model. 5 nodes seems to perform the best on a new data set however it performs a bit worse than previous classifiers.

KNN

We can use k-nearest neighbours a unsupervised method to predict the use level. We give it a training set and it evaluates a new point based on what the k nearest points in training set are classified as. We can create a for loop to try different values of k to find an optimal k.

druguse_num_s_unsupervised <- druguse_num_s[,-14]#create a dataframe without uselevel
AccuracyKNN <- matrix(NA, nrow = 10, ncol = 50)#allocate memory to store accuracies
colnames(AccuracyKNN) <- sprintf("KNN k=%s",1:50)
for (r in 1:50){#change k for each loop
  for (i in 1:10){#repeat 10 times
    bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
    
    pKNN <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=r)#create knn classifier and predictions
    
    
    AccuracyKNN[i,r] <- sum(pKNN==druguse_num_s[outofbag,]$UseLevel)/length(outofbag)#store accuracy 
  }
}
summary(AccuracyKNN)#show summary of accuracies
##     KNN k=1          KNN k=2          KNN k=3          KNN k=4      
##  Min.   :0.7367   Min.   :0.7295   Min.   :0.7454   Min.   :0.7415  
##  1st Qu.:0.7536   1st Qu.:0.7479   1st Qu.:0.7615   1st Qu.:0.7711  
##  Median :0.7597   Median :0.7516   Median :0.7728   Median :0.7836  
##  Mean   :0.7585   Mean   :0.7546   Mean   :0.7745   Mean   :0.7798  
##  3rd Qu.:0.7655   3rd Qu.:0.7649   3rd Qu.:0.7827   3rd Qu.:0.7921  
##  Max.   :0.7697   Max.   :0.7846   Max.   :0.8247   Max.   :0.8055  
##     KNN k=5          KNN k=6          KNN k=7          KNN k=8      
##  Min.   :0.7649   Min.   :0.7862   Min.   :0.7829   Min.   :0.7939  
##  1st Qu.:0.7702   1st Qu.:0.7905   1st Qu.:0.7928   1st Qu.:0.8006  
##  Median :0.7763   Median :0.7965   Median :0.7947   Median :0.8048  
##  Mean   :0.7819   Mean   :0.7981   Mean   :0.7960   Mean   :0.8046  
##  3rd Qu.:0.7909   3rd Qu.:0.8051   3rd Qu.:0.7987   3rd Qu.:0.8094  
##  Max.   :0.8119   Max.   :0.8137   Max.   :0.8115   Max.   :0.8146  
##     KNN k=9          KNN k=10         KNN k=11         KNN k=12     
##  Min.   :0.7971   Min.   :0.7843   Min.   :0.7969   Min.   :0.8046  
##  1st Qu.:0.8067   1st Qu.:0.8129   1st Qu.:0.8081   1st Qu.:0.8071  
##  Median :0.8195   Median :0.8186   Median :0.8177   Median :0.8114  
##  Mean   :0.8160   Mean   :0.8133   Mean   :0.8216   Mean   :0.8136  
##  3rd Qu.:0.8249   3rd Qu.:0.8218   3rd Qu.:0.8377   3rd Qu.:0.8203  
##  Max.   :0.8305   Max.   :0.8294   Max.   :0.8475   Max.   :0.8254  
##     KNN k=13         KNN k=14         KNN k=15         KNN k=16     
##  Min.   :0.7895   Min.   :0.8101   Min.   :0.8003   Min.   :0.8041  
##  1st Qu.:0.8169   1st Qu.:0.8153   1st Qu.:0.8132   1st Qu.:0.8221  
##  Median :0.8310   Median :0.8177   Median :0.8221   Median :0.8257  
##  Mean   :0.8257   Mean   :0.8223   Mean   :0.8215   Mean   :0.8266  
##  3rd Qu.:0.8354   3rd Qu.:0.8270   3rd Qu.:0.8270   3rd Qu.:0.8353  
##  Max.   :0.8460   Max.   :0.8448   Max.   :0.8446   Max.   :0.8470  
##     KNN k=17         KNN k=18         KNN k=19         KNN k=20     
##  Min.   :0.8067   Min.   :0.7942   Min.   :0.7986   Min.   :0.8077  
##  1st Qu.:0.8122   1st Qu.:0.8072   1st Qu.:0.8174   1st Qu.:0.8209  
##  Median :0.8216   Median :0.8128   Median :0.8280   Median :0.8259  
##  Mean   :0.8198   Mean   :0.8153   Mean   :0.8271   Mean   :0.8243  
##  3rd Qu.:0.8253   3rd Qu.:0.8199   3rd Qu.:0.8351   3rd Qu.:0.8286  
##  Max.   :0.8321   Max.   :0.8522   Max.   :0.8493   Max.   :0.8353  
##     KNN k=21         KNN k=22         KNN k=23         KNN k=24     
##  Min.   :0.8108   Min.   :0.8032   Min.   :0.7869   Min.   :0.8188  
##  1st Qu.:0.8200   1st Qu.:0.8158   1st Qu.:0.8073   1st Qu.:0.8241  
##  Median :0.8251   Median :0.8283   Median :0.8148   Median :0.8312  
##  Mean   :0.8273   Mean   :0.8255   Mean   :0.8172   Mean   :0.8306  
##  3rd Qu.:0.8331   3rd Qu.:0.8345   3rd Qu.:0.8308   3rd Qu.:0.8369  
##  Max.   :0.8576   Max.   :0.8404   Max.   :0.8456   Max.   :0.8443  
##     KNN k=25         KNN k=26         KNN k=27         KNN k=28     
##  Min.   :0.8084   Min.   :0.8074   Min.   :0.7895   Min.   :0.8183  
##  1st Qu.:0.8166   1st Qu.:0.8173   1st Qu.:0.8200   1st Qu.:0.8213  
##  Median :0.8209   Median :0.8248   Median :0.8243   Median :0.8248  
##  Mean   :0.8208   Mean   :0.8238   Mean   :0.8212   Mean   :0.8281  
##  3rd Qu.:0.8260   3rd Qu.:0.8277   3rd Qu.:0.8276   3rd Qu.:0.8358  
##  Max.   :0.8314   Max.   :0.8393   Max.   :0.8387   Max.   :0.8419  
##     KNN k=29         KNN k=30         KNN k=31         KNN k=32     
##  Min.   :0.8081   Min.   :0.8082   Min.   :0.7988   Min.   :0.8029  
##  1st Qu.:0.8182   1st Qu.:0.8184   1st Qu.:0.8209   1st Qu.:0.8222  
##  Median :0.8228   Median :0.8221   Median :0.8346   Median :0.8289  
##  Mean   :0.8255   Mean   :0.8257   Mean   :0.8317   Mean   :0.8271  
##  3rd Qu.:0.8320   3rd Qu.:0.8315   3rd Qu.:0.8425   3rd Qu.:0.8312  
##  Max.   :0.8466   Max.   :0.8510   Max.   :0.8533   Max.   :0.8478  
##     KNN k=33         KNN k=34         KNN k=35         KNN k=36     
##  Min.   :0.8095   Min.   :0.8157   Min.   :0.8045   Min.   :0.8131  
##  1st Qu.:0.8182   1st Qu.:0.8278   1st Qu.:0.8152   1st Qu.:0.8202  
##  Median :0.8202   Median :0.8301   Median :0.8312   Median :0.8239  
##  Mean   :0.8238   Mean   :0.8326   Mean   :0.8264   Mean   :0.8270  
##  3rd Qu.:0.8324   3rd Qu.:0.8384   3rd Qu.:0.8372   3rd Qu.:0.8281  
##  Max.   :0.8434   Max.   :0.8540   Max.   :0.8410   Max.   :0.8553  
##     KNN k=37         KNN k=38         KNN k=39         KNN k=40     
##  Min.   :0.8061   Min.   :0.7951   Min.   :0.8172   Min.   :0.7997  
##  1st Qu.:0.8177   1st Qu.:0.8151   1st Qu.:0.8253   1st Qu.:0.8198  
##  Median :0.8244   Median :0.8288   Median :0.8331   Median :0.8289  
##  Mean   :0.8261   Mean   :0.8269   Mean   :0.8318   Mean   :0.8264  
##  3rd Qu.:0.8343   3rd Qu.:0.8415   3rd Qu.:0.8359   3rd Qu.:0.8340  
##  Max.   :0.8543   Max.   :0.8484   Max.   :0.8484   Max.   :0.8475  
##     KNN k=41         KNN k=42         KNN k=43         KNN k=44     
##  Min.   :0.8200   Min.   :0.8026   Min.   :0.8017   Min.   :0.8194  
##  1st Qu.:0.8327   1st Qu.:0.8161   1st Qu.:0.8159   1st Qu.:0.8281  
##  Median :0.8418   Median :0.8179   Median :0.8248   Median :0.8349  
##  Mean   :0.8403   Mean   :0.8249   Mean   :0.8248   Mean   :0.8352  
##  3rd Qu.:0.8471   3rd Qu.:0.8372   3rd Qu.:0.8341   3rd Qu.:0.8394  
##  Max.   :0.8600   Max.   :0.8459   Max.   :0.8478   Max.   :0.8599  
##     KNN k=45         KNN k=46         KNN k=47         KNN k=48     
##  Min.   :0.8195   Min.   :0.8095   Min.   :0.8244   Min.   :0.8088  
##  1st Qu.:0.8261   1st Qu.:0.8189   1st Qu.:0.8294   1st Qu.:0.8168  
##  Median :0.8339   Median :0.8207   Median :0.8348   Median :0.8286  
##  Mean   :0.8331   Mean   :0.8207   Mean   :0.8360   Mean   :0.8274  
##  3rd Qu.:0.8408   3rd Qu.:0.8248   3rd Qu.:0.8399   3rd Qu.:0.8350  
##  Max.   :0.8437   Max.   :0.8304   Max.   :0.8513   Max.   :0.8516  
##     KNN k=49         KNN k=50     
##  Min.   :0.8078   Min.   :0.8131  
##  1st Qu.:0.8211   1st Qu.:0.8182  
##  Median :0.8293   Median :0.8297  
##  Mean   :0.8293   Mean   :0.8290  
##  3rd Qu.:0.8405   3rd Qu.:0.8360  
##  Max.   :0.8447   Max.   :0.8495

We can plot the mean accuracy and the range of each k to see how accuracy varies as we change the number of nearest neighbours.

AccuracyKNNdf <- data.frame(K=1:50, Mean=apply(AccuracyKNN,2,mean), Minimum=apply(AccuracyKNN,2,min), Maximum=apply(AccuracyKNN,2,max))#put accuracy statistics in a dataframe 
ggplot(data = AccuracyKNNdf, aes(x=K, y=Mean, ymin=Minimum, ymax=Maximum)) + geom_pointrange() + geom_smooth(method='loess') + ggtitle("KNN accuracy for K = 1 to 50") + ylab("Accuracy") #plot the mean and max and minimum accuracy for each k with a line of best fit

At about k=20-25 the accuracy seems to plato suggesting above this point we risk overfitting the data. The accuracy is similar to that of previous methods.

Ensemble learning

We can combine these methods above by taking the modal prediction of the above classifiers with equal weights in the hope of improving accuracy and reducing the range of accuracies of the model on different data sets. As previous methods had similar accuracies this may allow us to obtain a better overall accuracy as even though they have similar accuracies they may not be getting the same data points misclassified.

Accuracyen <- matrix(NA, nrow = 10, ncol = 1)#allocate memory to store accuracies
colnames(Accuracyen) <- "Ensemble" #name columns

#find mode function
getmodalvalue <- function(x) {
   u <- unique(x)#find unique values of x
   u[which.max(tabulate(match(x, u)))]#find modal value of x
}


for (i in 1:10){#repeat 10 times for bootstrapping
  bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
  outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
  
  #create models
  models <- list(glm(UseLevel ~ ., family=binomial(link='logit'), data=druguse_num_s[bootsample,]),
                 naiveBayes(UseLevel ~ ., data=druguse_num_s[bootsample,]),
                 lda(UseLevel ~ ., data=druguse_num_s[bootsample,]),
                 mda(UseLevel ~ ., data=druguse_num_s[bootsample,], subclasses = 3),
                 svm(UseLevel~.,data = druguse_num_s[bootsample,], kernel="linear"))
  predictions <- lapply(models,predict,newdata=druguse_num_s[outofbag,])#create predictions
  
  predictions[[1]] <- ifelse(predictions[[1]]>0.5, 'high','low')#set cut off for glm
  
  predictions[[3]] <- predictions[[3]]$class#extract class from lda predictions
  predictions[[6]] <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=25)#create knn classifier and predictions
  nnmodel <- nnet(UseLevel~. , data=druguse_num_s[bootsample,], size = 5, maxit=1000, trace=F)#create neural network with no trace
  predictions[[7]] <- predict(nnmodel, druguse_num_s[outofbag,], type="class") #add neural network predictions 
  predictions <- apply(data.frame(predictions),2,as.character)#set all predictions elements to character type using apply
  
  finalpred <- apply(predictions,1,getmodalvalue)#find mode of each row of the predictions from each model
  
  Accuracyen[i] <- sum(finalpred==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy 
}
summary(Accuracyen)#sumarise accuracy of equally weighted ensamble
##     Ensemble     
##  Min.   :0.8166  
##  1st Qu.:0.8262  
##  Median :0.8333  
##  Mean   :0.8331  
##  3rd Qu.:0.8377  
##  Max.   :0.8542

We then use summary to show the performance of our classifier on a new data set. We see the performance is similar to some of the techniques on their own. If we change the weights of each prediction we could improve this. We could do this ourselves trying different values but this would take a long time. So we could use a neural network to change the weights of each model in our classifier. A neural network could learn appropriate weights to improve performance of the classifier faster than us using a for loop to change weights to find the best weights.

classifiers <- sprintf("%s",1:7)#create colnames for 2nd nnet input
Accuracyennn <- matrix(NA, nrow=10, ncol=4)#create matrix to store accuracies
colnames(Accuracyennn) <- sprintf("Nodes %s", (1:4)*5)#create names for accuracy matrix columns
for (r in 1:4){#try different nn sizes 5,10,15,20
  for (i in 1:10){#repeat 10 times
    bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
    
    #create models
    models <- list(glm(UseLevel ~ ., family=binomial(link='logit'), data=druguse_num_s[bootsample,]),
                   naiveBayes(UseLevel ~ ., data=druguse_num_s[bootsample,]),
                   lda(UseLevel ~ ., data=druguse_num_s[bootsample,]),
                   mda(UseLevel ~ ., data=druguse_num_s[bootsample,], subclasses = 3),
                   svm(UseLevel~.,data = druguse_num_s[bootsample,], kernel="linear"))
    #predict using models
    predictions <- lapply(models,predict,newdata=druguse_num_s[outofbag,])
    
    predictions[[1]] <- ifelse(predictions[[1]]>0.5, 'high','low')#set cut off for glm
    
    predictions[[3]] <- predictions[[3]]$class#extract class for lda
    predictions[[6]] <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=25)#create knn classifier and predictions
    nnmodel1 <- nnet(UseLevel~. , data=druguse_num_s[bootsample,], size = 5, maxit=1000, trace=F)#make nnet with no trace
    predictions[[7]] <- predict(nnmodel1, druguse_num_s[outofbag,], type="class")  #predict using nnet
    predictions <- data.frame(apply(data.frame(predictions),2,as.character)) #make each element of character type using apply
    
    colnames(predictions) <- classifiers #assign names to columns for next nnet
    
    predictions$UseLevel <- druguse$UseLevel[outofbag] #add true uselevels as column to predictions data frame 
    
    nnmodel2 <- nnet(UseLevel~.,data=predictions, size=(r*5), maxit=1000, trace=F) #make 2nd neural network using predictions from all models with no trace
    
    #make new samples
    bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
    
    #create predictions on new inicies using models made previously
    predictions <- lapply(models,predict,newdata=druguse_num_s[outofbag,])
    
    predictions[[1]] <- ifelse(predictions[[1]]>0.5, 'high','low')
    
    predictions[[3]] <- predictions[[3]]$class
    predictions[[6]] <- knn(druguse_num_s_unsupervised[bootsample,],druguse_num_s_unsupervised[outofbag,],druguse_num_s[bootsample,]$UseLevel,k=25)#create knn classifier and predictions
 
    predictions[[7]] <- predict(nnmodel1, druguse_num_s[outofbag,], type="class") #make predictions using neural net 
    predictions <- data.frame(apply(data.frame(predictions),2,as.character)) #make all elements of character type
    
    colnames(predictions) <- classifiers #assign same names to columns
    
    predictions$UseLevel <- druguse$UseLevel[outofbag] #add true use level
    
    finalpred <- predict(nnmodel2,predictions,type='class') #make final predictions using 2nd nnet and new predictions
    
    Accuracyennn[i,r] <- sum(finalpred==druguse_num_s$UseLevel[outofbag])/length(outofbag)#store accuracy 
  }
}
summary(Accuracyennn)#sumarise accuracies
##     Nodes 5          Nodes 10         Nodes 15         Nodes 20     
##  Min.   :0.8182   Min.   :0.8291   Min.   :0.8104   Min.   :0.8209  
##  1st Qu.:0.8330   1st Qu.:0.8403   1st Qu.:0.8376   1st Qu.:0.8363  
##  Median :0.8389   Median :0.8468   Median :0.8433   Median :0.8408  
##  Mean   :0.8359   Mean   :0.8475   Mean   :0.8403   Mean   :0.8416  
##  3rd Qu.:0.8394   3rd Qu.:0.8570   3rd Qu.:0.8492   3rd Qu.:0.8438  
##  Max.   :0.8450   Max.   :0.8702   Max.   :0.8668   Max.   :0.8681

Summarise the accuracy of the classifier to see how the classifier performs. This technique with 10 nodes seems to perform best as the neural network modifies the weights of each classifiers performance based on its relative performance we see a slight improvement over the methods on their own.

Question 3.2

To estimate how the model performs on a new data set I have used bootstrapping where the training data set is randomly sampled with replacement from the whole data set and the rest of the data set is used as a validation data set. To compare the classifiers and the ensemble learning techniques we can plot a boxplot. We only plot the best performing classifier for classification techniques where we tested multiple parameters.

comparison <- data.frame("GLM"=Accuracy, "Naive Bayes"=Accuracybayes, "LDA"=Accuracylda,"MDA"=Accuracymda[,3],"SVM"=Accuracysvms[,1],"NN"=Accuracynn[,1], "KNN"=AccuracyKNN[,25],"Ensemble"=Accuracyen,"Ensemble NN"=Accuracyennn[,2])#create data frame of accuracies of the models with their best parameters

melted_comparison <- melt(comparison)#reshape to make it easier to plot
## No id variables; using all as measure variables
ggplot(melted_comparison, aes(x=variable, y=value, fill=variable)) + geom_boxplot(colour=rainbow(9)) + coord_flip()  + theme(legend.position="none") + xlab("Classifier") + ylab("Accuracy") + ggtitle("Comparison of classification techniques for use level") #create horizontal boxplot of all techniques accuracies 

The ensamble neural net technique perform best as some classifiers perform better in certain cases and the neural net tries to learn the best possible combination of the classifiers predictions to get the highest accuracy which it has done successfully but the performance of the logistic regression classifier is similar. However a predicted average accuracy of 85% on a new data set is high.

Question 4

Question 4.1

Add a new column of factors for whether the individual has ever used heroin.

#create used heroin column and assign yes or no
druguse$usedheroin[druguse$heroin==0] <- 'no'
druguse$usedheroin[druguse$heroin!=0] <- 'yes'
druguse$usedheroin <- as.factor(druguse$usedheroin) #change from character to factor type 
str(druguse) #check column has been added to the dataframe and is of correct type
## 'data.frame':    1885 obs. of  34 variables:
##  $ agegroup         : Factor w/ 6 levels "18-24","25-34",..: 1 1 2 2 3 1 1 2 3 1 ...
##  $ gender           : Factor w/ 2 levels "female","male": 1 2 1 2 1 1 2 1 1 2 ...
##  $ education        : num  0.455 -0.611 1.164 1.164 0.455 ...
##  $ country          : Factor w/ 7 levels "Australia","Canada",..: 6 7 6 6 6 7 7 6 6 6 ...
##  $ ethnicity        : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ neuroticism      : num  -0.1488 -0.921 1.0212 -1.5508 -0.0519 ...
##  $ extraversion     : num  -0.948 1.741 0.962 1.114 0.962 ...
##  $ opentoexperience : num  -1.4242 0.5833 -0.9763 -0.0193 -1.119 ...
##  $ agreeableness    : num  -0.4532 -1.3429 -0.6063 -0.0173 0.9416 ...
##  $ conscientiousness: num  0.26 1.134 -0.143 1.462 1.631 ...
##  $ impulsiveness    : num  0.53 0.53 -1.38 0.193 -1.38 ...
##  $ sensation        : num  -0.8464 1.2247 -1.1808 0.0799 -2.0785 ...
##  $ caffeine         : num  4 6 6 6 5 6 5 6 6 5 ...
##  $ chocolate        : num  5 5 6 5 6 6 4 6 6 6 ...
##  $ nicotine         : num  2 6 5 4 0 6 4 3 0 4 ...
##  $ alcohol          : num  4 4 5 5 3 5 3 3 5 5 ...
##  $ amphetamine      : num  0 3 0 0 0 0 2 0 0 4 ...
##  $ amylnitrite      : num  0 0 2 0 0 0 3 0 0 2 ...
##  $ benzodiaz        : num  0 2 0 0 0 0 3 0 0 3 ...
##  $ cannabis         : num  2 5 3 2 0 6 5 0 1 4 ...
##  $ cocaine          : num  0 3 2 0 0 0 2 0 0 2 ...
##  $ crack            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ecstasy          : num  0 0 0 0 0 0 3 0 0 4 ...
##  $ heroin           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ketamine         : num  0 0 0 0 0 0 0 0 0 4 ...
##  $ legalhighs       : num  0 3 2 4 0 5 3 0 0 3 ...
##  $ LSD              : num  0 0 0 0 0 0 3 0 0 2 ...
##  $ methadone        : num  0 0 0 0 0 2 4 0 0 0 ...
##  $ mushrooms        : num  0 3 0 0 0 3 3 0 0 2 ...
##  $ volatiles        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ any              : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 1 2 2 2 2 2 ...
##  $ severity         : num  4 25 14 10 0 22 35 3 1 34 ...
##  $ UseLevel         : Factor w/ 2 levels "low","high": 1 2 2 1 1 2 2 1 1 2 ...
##  $ usedheroin       : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Create the training and test data sets including the new used heroin column and other relevant columns.

druguseheroin_predictors <- druguse[c(1:23,25:30,34)]#create dataframe only of predictors removing unwanted columns
druguseheroin_train <- druguseheroin_predictors[1:1400,] #create training data by removing rows of test data
druguseheroin_test <- druguseheroin_predictors[1401:1885,]#create test data from remaining rows

Create a random forest, a confusion table and calculate the accuracy.

rf.heroin <- randomForest(usedheroin ~ ., data=druguseheroin_train) #create a random forest
prf=predict(rf.heroin, newdata=druguseheroin_test)#Create predictions 

tablerfh <- table(prf,druguseheroin_test$usedheroin) #create confusion table

#calculate incorrect and correct classifications of each type
TP=tablerfh[1]
FP=tablerfh[2]
TN=tablerfh[3]
FN=tablerfh[4]

Accuracyrfh <- (TP+TN)/(TP+TN+FN+FP)#calculate accuracy
tablerfh #output table
##      
## prf    no yes
##   no  399  35
##   yes   6  45
Accuracyrfh #output accuracy
## [1] 0.8948454

The random forest performs well however in the data set there arent many people who have used heroin so a classifier can perform well on the data set by being very biased towards picking no. This is why in the confusion table we have more false negatives than false positives suggesting the classifier may be doing this to an extent. The numbers of yes and no are shown below to illistrate this.

x <- druguse %>% group_by(usedheroin) %>% tally() #tally number of people who have and havent used heroin
x #print tibble
## # A tibble: 2 x 2
##   usedheroin     n
##   <fct>      <int>
## 1 no          1605
## 2 yes          280
x[1,2]/nrow(druguse) #show accuracy if just predict no all the time
##           n
## 1 0.8514589

As shown above 85% of the data set is no so if we produce a classifier just predicting no we can expect an accuracy of arround 85%. This means the random forest only performs slightly better than this.

Question 4.2

Predict severity based on first 14 predictors and cannabis use but no other illegal substances and without the non-significant predictors shown in the data analysis (alcohol, extraversion and country). We use scaled data as this will likely improve the performance of our regession models and create the data set of the predictors and severity. As this is regression we will compare the regressors by their mean squared error. We can only choose regression techniques as severity isnt a class. We can evaluate each technique and if it performs well we can use it in an ensemble like in question 3 to see if we can lower the MSE.

druguse_pred_sev <- druguse_num_s_unsupervised #without variables with high p value
druguse_pred_sev$cannabis <- scale(druguse$cannabis) #add scaled cannabis usage
druguse_pred_sev$severity <- scale(druguse$severity) #add scaled severity

Lasso/Ridge regression

Lasso and Ridge regression are penalised linear regression techniques. So tries to find a set of coefficents of the predictors but subject to a penalty on large values of coefficents this means it is less susceptible to data sets with extreme values and thus is a shrinkage method. This could make it work well with our data.

mselasso <- matrix(NA,nrow = 10,ncol = 21)#create array to store mse 
colnames(mselasso) <- sprintf("Lasso/Ridge alpha=%s",(1:21-1)/20)#name columns

for (r in 1:21){#alpha = 0,0.05,0.1....1 with alpha=0 being pure ridge and alpha=1 pure lasso 
  for (i in 1:10){#repeat 10 times
    bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data indicies
    
    #create input for model
    x <- model.matrix(severity~., druguse_pred_sev[bootsample,])#train x
    y <- druguse_pred_sev$severity[bootsample,] #train y
    
    
    modellasso <- cv.glmnet(x,y, alpha = (r-1)/20)#create lasso/ridge with alpha=0 to 1 and find optimal lambda
    plas <- predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#predict on validation data
    
    mselasso[i,r] <- mean((plas - druguse_pred_sev$severity[outofbag])^2)#calculate mean squared error and store
  }
}
summary(mselasso)#sumarise mse
##  Lasso/Ridge alpha=0 Lasso/Ridge alpha=0.05 Lasso/Ridge alpha=0.1
##  Min.   :0.2213      Min.   :0.2029         Min.   :0.1852       
##  1st Qu.:0.2560      1st Qu.:0.2247         1st Qu.:0.1998       
##  Median :0.2678      Median :0.2278         Median :0.2280       
##  Mean   :0.2658      Mean   :0.2300         Mean   :0.2255       
##  3rd Qu.:0.2823      3rd Qu.:0.2377         3rd Qu.:0.2507       
##  Max.   :0.2901      Max.   :0.2538         Max.   :0.2595       
##  Lasso/Ridge alpha=0.15 Lasso/Ridge alpha=0.2 Lasso/Ridge alpha=0.25
##  Min.   :0.1983         Min.   :0.1787        Min.   :0.1925        
##  1st Qu.:0.2106         1st Qu.:0.2162        1st Qu.:0.2097        
##  Median :0.2203         Median :0.2255        Median :0.2228        
##  Mean   :0.2245         Mean   :0.2288        Mean   :0.2215        
##  3rd Qu.:0.2389         3rd Qu.:0.2419        3rd Qu.:0.2310        
##  Max.   :0.2529         Max.   :0.2753        Max.   :0.2463        
##  Lasso/Ridge alpha=0.3 Lasso/Ridge alpha=0.35 Lasso/Ridge alpha=0.4
##  Min.   :0.1905        Min.   :0.1722         Min.   :0.1937       
##  1st Qu.:0.2249        1st Qu.:0.1992         1st Qu.:0.2125       
##  Median :0.2368        Median :0.2143         Median :0.2350       
##  Mean   :0.2389        Mean   :0.2206         Mean   :0.2384       
##  3rd Qu.:0.2547        3rd Qu.:0.2389         3rd Qu.:0.2583       
##  Max.   :0.3005        Max.   :0.2875         Max.   :0.2914       
##  Lasso/Ridge alpha=0.45 Lasso/Ridge alpha=0.5 Lasso/Ridge alpha=0.55
##  Min.   :0.1898         Min.   :0.1838        Min.   :0.1712        
##  1st Qu.:0.2107         1st Qu.:0.1991        1st Qu.:0.2081        
##  Median :0.2309         Median :0.2091        Median :0.2202        
##  Mean   :0.2249         Mean   :0.2123        Mean   :0.2258        
##  3rd Qu.:0.2427         3rd Qu.:0.2222        3rd Qu.:0.2497        
##  Max.   :0.2489         Max.   :0.2498        Max.   :0.2738        
##  Lasso/Ridge alpha=0.6 Lasso/Ridge alpha=0.65 Lasso/Ridge alpha=0.7
##  Min.   :0.2120        Min.   :0.1804         Min.   :0.1778       
##  1st Qu.:0.2325        1st Qu.:0.2012         1st Qu.:0.2074       
##  Median :0.2384        Median :0.2064         Median :0.2254       
##  Mean   :0.2400        Mean   :0.2167         Mean   :0.2192       
##  3rd Qu.:0.2522        3rd Qu.:0.2318         3rd Qu.:0.2306       
##  Max.   :0.2642        Max.   :0.2614         Max.   :0.2451       
##  Lasso/Ridge alpha=0.75 Lasso/Ridge alpha=0.8 Lasso/Ridge alpha=0.85
##  Min.   :0.1945         Min.   :0.1646        Min.   :0.1595        
##  1st Qu.:0.2193         1st Qu.:0.2106        1st Qu.:0.2084        
##  Median :0.2437         Median :0.2249        Median :0.2167        
##  Mean   :0.2354         Mean   :0.2232        Mean   :0.2165        
##  3rd Qu.:0.2516         3rd Qu.:0.2447        3rd Qu.:0.2394        
##  Max.   :0.2659         Max.   :0.2507        Max.   :0.2693        
##  Lasso/Ridge alpha=0.9 Lasso/Ridge alpha=0.95 Lasso/Ridge alpha=1
##  Min.   :0.1640        Min.   :0.1761         Min.   :0.1823     
##  1st Qu.:0.1923        1st Qu.:0.2095         1st Qu.:0.2123     
##  Median :0.2229        Median :0.2129         Median :0.2273     
##  Mean   :0.2150        Mean   :0.2110         Mean   :0.2251     
##  3rd Qu.:0.2384        3rd Qu.:0.2200         3rd Qu.:0.2410     
##  Max.   :0.2495        Max.   :0.2301         Max.   :0.2572

We can plot alpha against the average mean squared error to see the optimal alpha.

#create dataframe from data for ggplot
lasso <- data.frame("alpha"=(1:21 - 1)/20, "avg mse"=apply(mselasso,2,mean))

#plot avg mse and alpha in ggplot 
ggplot(data = lasso, aes(x = alpha, y = avg.mse)) + geom_smooth() + geom_point() + ggtitle("Average MSE for alpha")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The plot suggests better peformance as alpha goes to 1 ie lasso rather than ridge however the difference is small.

General Linear Model

We can apply a general linear model again this time using family gaussian as this is regression question and in our data analysis the data appeared to be gaussian distributed. We also use a basis transformation on the model to include the products of predictors as well as this may improve accuracy at the risk of overfitting.

mseglm <- matrix(NA, nrow = 10, ncol = 1)#create place to store mse
colnames(mseglm) <- "GLM"#mame column
for (i in 1:10){#repeat 10 times
  bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample indicies
  outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data indicies
  
  
  modelglm <- glm(severity~.^2, data = druguse_pred_sev, family = gaussian(link = "identity")) #create glm model using the predictors and their products
  
  pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
  
  mseglm[i] <- mean((pglm - druguse_pred_sev$severity[outofbag,])^2) #calculate mean squared error
}
summary(mseglm)#show summary of mse
##       GLM        
##  Min.   :0.2711  
##  1st Qu.:0.2763  
##  Median :0.2888  
##  Mean   :0.2892  
##  3rd Qu.:0.2975  
##  Max.   :0.3202

The GLM performs worse than ridge regression this could be due to extreme values that dont effect lasso/ridge regression as much as a method that doesn’t penalise them such as a GLM.

Multivariate Adaptive Regression Splines

MARS uses regression splines that split the predictors into sections which it fits a linear model. It does this adaptively to multiple variables it does this by going forwards first finding a pair of predictors that produces the lowest sum of squares residual error by testing every combination. Adding more predictors this way. It then goes backwards removing the least effective terms until it finds the best model. We can create a model as below and summarise the accuracies. This could allow us to further remove unuseful predictors and as this method uses splines it can create a peicewise linear model which may be a better regressor than one that is completely linear.

msemars <- matrix(NA,nrow = 10, ncol = 1)#create place to store mse
colnames(msemars) <- "MARS"#name column
for (i in 1:10){#repeat 10 times
  bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample indicies
  outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data indicies
  
  x <- druguse_pred_sev[bootsample, -15]
  y <- druguse_pred_sev$severity[bootsample] 
  
  modelmars <- mars(x, y, forward.step = T, prune = T, degree = 1) #create linear mars model
  pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
  
  msemars[i] <- mean((pmars - druguse_pred_sev$severity[outofbag])^2) #calculate mean squared error
}
summary(msemars)#show summary of mse
##       MARS       
##  Min.   :0.2844  
##  1st Qu.:0.3110  
##  Median :0.3333  
##  Mean   :0.3254  
##  3rd Qu.:0.3372  
##  Max.   :0.3522

MARS offers higher MSE than the previous techniques.

We can plot the cuts of each predictor from the model as below. Straight lines indicate predictors that the model thinks are not significant through it’s forward step and then backwards pruning of predictors.

par(mfrow = c(4, 4))#create 4x4 grid in plotting device
for (i in 1:14)#plot all predictors
  {
    xp <- matrix(sapply(druguse_pred_sev[1:14], mean), nrow(druguse_pred_sev), ncol(druguse_pred_sev) - 1, byrow = TRUE)#create matrix of means
    xr <- sapply(druguse_pred_sev, range)#find range of each predictor
    xp[, i] <- seq(xr[1, i], xr[2, i], len=nrow(druguse_pred_sev))#create sequence of the length of the rows of df across the range of each predictor
    xf <- predict(modelmars, xp)#predict on the sequence
    plot(xp[, i], xf, xlab = names(druguse_pred_sev)[i], ylab = "", type = "l");#plot predictions against range of each predictor
  }

Neural Net

We can use a neural network again testing different sizes to predict severity. Neural nets try to predict a function that relates the predictors to the value being predicted so could be a good technique for regression. We can test different sizes to see the optimal size

msenn <- matrix(NA,nrow=10,ncol=4) #allocate memory to store accuracies
colnames(msenn) <- sprintf("NN nodes=%s", (1:4)*5)
for (r in 1:4){#for loop for testing model sizes 5,10,15,20
  for (i in 1:10){#for loop for cross validating each model
    bootsample <- sample(nrow(druguse_pred_sev),nrow(druguse_pred_sev), replace = T)#take sample inicies
    outofbag <- setdiff(1:nrow(druguse_pred_sev), bootsample)#create validation data

    nnmodel <- nnet(severity~. , data=druguse_pred_sev[bootsample,], size = (r*5), maxit=1000, linout=T, trace=F) #create nnet with linear output
    #predict on validation data
    nnpredn = predict(nnmodel,druguse_pred_sev[outofbag,])

    #calculate accuracy and store it
    msenn[i,r] <- mean((nnpredn - druguse_pred_sev$severity[outofbag])^2) 
  }
}
summary(msenn)#summarise mse

We can use summary to see the mse of the neural nets. The performance of the neural nets is poor with a high range as there are extreme outliers this suggests its not a good model in this case.

Ensemble Learning

We can provide the models not including the neural net as its performance is poor into an ensamble technique by taking the mean average output of the predictions of the regressors rather than the mode as in the classification case. This will hopefully lower the MSE.

mseen <- matrix(NA, nrow=10, ncol=1)#create matrix to store accuracies
colnames(mseen) <- "Ensemble"#create names for accuracy matrix 

for (i in 1:10){#repeat 10 times
  bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
  outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
  
  #create inputs for rigde ands lasso
  x <- model.matrix(severity~., druguse_pred_sev[bootsample,])
  y <- druguse_pred_sev$severity[bootsample] 
  
  modelridge <- cv.glmnet(x,y, alpha = 0)#create ridge model
  ridgepred <-  predict(modelridge, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
  
  modellasso <- cv.glmnet(x,y, alpha = 1)#create lasso model
  lassopred <-  predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
  
  modelglm <- glm(severity~.^2, data = druguse_pred_sev, family = gaussian(link = "identity")) #create glm
  pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
  
  #create input for mars
  x <- druguse_pred_sev[bootsample, -15]
  y <- druguse_pred_sev$severity[bootsample] 

  modelmars <- mars(x, y, forward.step = T, prune = T, degree = 1) #create mars model
  pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
  
  
  predictions <- data.frame("ridge"=ridgepred, "lasso"=lassopred, "glm"=pglm, "mars"=pmars)#create data frame of predictions
  
  finalpred <- apply(predictions,1,mean) #make final predictions by averaging predictions of each model
  
  mseen[i] <- mean((finalpred - druguse_pred_sev$severity[outofbag])^2) #calculate mse
}
summary(mseen)#summarise mse
##     Ensemble     
##  Min.   :0.2110  
##  1st Qu.:0.2145  
##  Median :0.2182  
##  Mean   :0.2189  
##  3rd Qu.:0.2214  
##  Max.   :0.2344

This offers an improvement over the regression techniques individually with a small range.

We can as before use a neural net instad of averaging to see if it can find a better way of combining them than using the mean as different regressors are better at predicting than others in certain cases.

mseennn <- matrix(NA, nrow=10, ncol=4)#create matrix to store accuracies
colnames(mseennn) <- sprintf("Ensemble NN nodes=%s", (1:4)*5)#create names for accuracy matrix 
for (r in 1:4){
  for (i in 1:10){#repeat 10 times
    bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies
    outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
    
    #create inputs for rigde ands lasso
    x <- model.matrix(severity~., druguse_pred_sev[bootsample,])
    y <- druguse_pred_sev$severity[bootsample] 
    
    modelridge <- cv.glmnet(x,y, alpha = 0)#create ridge model
    ridgepred <-  predict(modelridge, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
    
    modellasso <- cv.glmnet(x,y, alpha = 1)#create lasso model
    lassopred <-  predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
    
    modelglm <- glm(severity~.^2, data = druguse_pred_sev, family = gaussian(link = "identity")) #create glm
    pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
    
    #create input for mars
    x <- druguse_pred_sev[bootsample, -15]
    y <- druguse_pred_sev$severity[bootsample] 
  
    modelmars <- mars(x, y, forward.step = T, prune = T, degree = 1) #create mars model
    pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
    
    
    predictions <- data.frame("ridge"=ridgepred, "lasso"=lassopred, "glm"=pglm, "mars"=pmars)#create data frame of predictions
    predictions$severity <- druguse_pred_sev$severity[outofbag] #add severity column
    
    nnmodelen <- nnet(severity~., data=predictions, linout=T, maxit=1000, size=(r*5), trace=F)#create nn model using predictors 
    
    #create new samples
    bootsample <- sample(nrow(druguse_num_s_unsupervised),nrow(druguse_num_s_unsupervised), replace = T)#take sample indicies 
    outofbag <- setdiff(1:nrow(druguse_num_s_unsupervised), bootsample)#create validation data indicies
    
    ridgepred <-  predict(modelridge, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
    lassopred <-  predict(modellasso, as.matrix(druguse_pred_sev[outofbag,]), s = 'lambda.min')#create predictions
    pglm <- predict(modelglm, druguse_pred_sev[outofbag, ])#predict on validation data
    pmars <- predict(modelmars, druguse_pred_sev[outofbag,-15])#predict on validation data
    
    predictions2 <- data.frame("ridge"=ridgepred, "lasso"=lassopred, "glm"=pglm, "mars"=pmars)#create data frame of predictions
    predictions2$severity <- druguse_pred_sev$severity[outofbag] #add severity column
    
    finalpred <- predict(nnmodelen, newdata=predictions2) #use nnet model to make predictions
    
    mseennn[i,r] <- mean((finalpred - druguse_pred_sev$severity[outofbag])^2) #calculate mse
  }
}
summary(mseennn)#summarise mse
##  Ensemble NN nodes=5 Ensemble NN nodes=10 Ensemble NN nodes=15
##  Min.   :0.04419     Min.   :0.03715      Min.   :0.05319     
##  1st Qu.:0.05180     1st Qu.:0.04598      1st Qu.:0.06037     
##  Median :0.05686     Median :0.05569      Median :0.06986     
##  Mean   :0.05930     Mean   :0.05451      Mean   :0.08604     
##  3rd Qu.:0.06745     3rd Qu.:0.06069      3rd Qu.:0.09617     
##  Max.   :0.07809     Max.   :0.07316      Max.   :0.19879     
##  Ensemble NN nodes=20
##  Min.   :0.04467     
##  1st Qu.:0.05533     
##  Median :0.07329     
##  Mean   :0.18525     
##  3rd Qu.:0.11521     
##  Max.   :1.11177

The performance is best with with 10 nodes in the hidden layer and is substantially better than any previous technique.

Comparison

We can compare the relative performance of the models by using a boxplot to show their MSE. We have to remove some extreme outliers from the neural nets and only plot pure ridge and lasso.

mse <- cbind(mselasso[,c(1,21)],mseglm,msemars,msenn,mseen,mseennn)#combine mse matricies
mse_melted <- melt(mse)#melt to reshape and turn to data frame
mse_melted_outlier <- mse_melted[mse_melted$value<1,] #remove extreme outliers from neural nets

ggplot(mse_melted_outlier, aes(x=Var2, y=value, fill=Var2)) + geom_boxplot(colour=rainbow(13)) + coord_flip()  + theme(legend.position="none") + xlab("Regression technique") + ylab("MSE") + ggtitle("Comparison of regression techniques for severity") #crate boxplot of all techniques mses

We see the ensamble techniques offer the lowest MSE like they did in the classification problem with the neural network ensamble technique having musch lower MSEs than any other method. 10 node neural net ensemble performing the best but other numbers of nodes performing similarly.The neural nets performing badly as techniques on their own but suprisingly the best when used to combine the outputs of the other regressors.

Question 5

In question 1 we showed that the pvalues of country, extraversion and alcohol are not significant in predicting use level or severity suggesting they aren’t relevant in predicting substance abuse.

druguse %>% group_by(UseLevel, alcohol) %>% tally()#group by uselevel and alcohol in the dataframe and tally cases
## # A tibble: 14 x 3
## # Groups:   UseLevel [?]
##    UseLevel alcohol     n
##    <fct>      <dbl> <int>
##  1 low            0    24
##  2 low            1    18
##  3 low            2    22
##  4 low            3    73
##  5 low            4   116
##  6 low            5   365
##  7 low            6   231
##  8 high           0    10
##  9 high           1    16
## 10 high           2    46
## 11 high           3   125
## 12 high           4   171
## 13 high           5   394
## 14 high           6   274

we can see for uselevel high and low theyre similarly distributed suggesting that regardless of druguse alcohol use is similarly distributed

par(mfrow=c(1,2))#create space in ploting device to have histograms next to each other
hist(druguse$alcohol[druguse$UseLevel=='high'],main = "Alcohol given high use level", xlab ="Alcohol", breaks=6)#plot histogram
hist(druguse$alcohol[druguse$UseLevel=='low'],main = "Alcohol given low use level", xlab ="Alcohol", breaks=6)#plot histogram

Plotting the histograms we can see they are very similarly distributed in both cases suggesting its not a good predictor. This is why it has a correlation near to zero and a high pvalue. The same is true of country and extraversion. In the case of country however it may be bacause most of the data comes from the UK or US and theres not much data for other countries.

druguse %>% group_by(UseLevel, country) %>% tally()#group by uselevel and country in the dataframe and tally cases
## # A tibble: 14 x 3
## # Groups:   UseLevel [?]
##    UseLevel country               n
##    <fct>    <fct>             <int>
##  1 low      Australia             8
##  2 low      Canada               30
##  3 low      NewZealand            1
##  4 low      Other                27
##  5 low      RepublicofIreland     5
##  6 low      UK                  716
##  7 low      USA                  62
##  8 high     Australia            46
##  9 high     Canada               57
## 10 high     NewZealand            4
## 11 high     Other                91
## 12 high     RepublicofIreland    15
## 13 high     UK                  328
## 14 high     USA                 495
par(mfrow=c(1,2))#create space in ploting device to have histograms next to each other
hist(druguse$extraversion[druguse$UseLevel=='high'],main = "Extraversion given high use level", xlab ="Extraversion", breaks=6)#plot histogram
hist(druguse$extraversion[druguse$UseLevel=='low'],main = "Extraversion given low use level", xlab ="Extraversion", breaks=6)#plot histogram

Severity and Use Level are highly correlated so the same predictors are also not significant for the same reasons. However in the case of any drug use the unsignificant predictors are different.

pvalues_cor_any <- melted_cor_pvalues_ord[melted_cor_pvalues_ord$Var1=="any",]#find pvalues relating to any
head(pvalues_cor_any)#show head
##     Var1         Var2        value
## 130  any      country 5.098144e-01
## 460  any    chocolate 1.606900e-01
## 229  any extraversion 3.176931e-02
## 427  any     caffeine 1.643696e-02
## 526  any      alcohol 5.556716e-05
## 196  any  neuroticism 1.407483e-06

We can see country and chocolate are not significant at the 5% significance level for any.

druguse %>% group_by(any, chocolate) %>% tally()#group by any and chocolate and tally
## # A tibble: 12 x 3
## # Groups:   any [?]
##    any   chocolate     n
##    <fct>     <dbl> <int>
##  1 FALSE         0     5
##  2 FALSE         3     5
##  3 FALSE         4    21
##  4 FALSE         5    67
##  5 FALSE         6    97
##  6 TRUE          0    27
##  7 TRUE          1     3
##  8 TRUE          2    10
##  9 TRUE          3    49
## 10 TRUE          4   275
## 11 TRUE          5   616
## 12 TRUE          6   710
druguse %>% group_by(any, country) %>% tally()#group by any and country and tally
## # A tibble: 12 x 3
## # Groups:   any [?]
##    any   country               n
##    <fct> <fct>             <int>
##  1 FALSE Australia             1
##  2 FALSE Canada                3
##  3 FALSE Other                 5
##  4 FALSE UK                  181
##  5 FALSE USA                   5
##  6 TRUE  Australia            53
##  7 TRUE  Canada               84
##  8 TRUE  NewZealand            5
##  9 TRUE  Other               113
## 10 TRUE  RepublicofIreland    20
## 11 TRUE  UK                  863
## 12 TRUE  USA                 552

From this we can see that this is because in the case of chocolate the data is similarly distributed for true and false this is shown in the histograms drawn below. In the case of country this is bacause there aren’t mant falses for any country as shown above. This why they have a correlation of around 0.

par(mfrow=c(1,2))#create space in ploting device to have histograms next to each other
hist(druguse$chocolate[druguse$any==T],main = "Chocolate given they use drugs", xlab ="Chocolate", breaks=6)#plot histogram 
hist(druguse$chocolate[druguse$any==F],main = "Chocolate given they use drugs", xlab ="Chocolate", breaks=6)#plot histogram

In the summary of the logistic regression model of question 2 all of these predictors werent significant in the model however country being UK was significant where as other countries were not and the age groups other than 45-54 and 55-64 weren’t significant and ethnicity, nueroticism, agreeableness and caffeine weren’t significant. In the summary of that model it suggests gender being male, country being UK, open to experience, education, concientiousness, sensation and nicotine are significant. These being significant make sense things like alcohol are likely not significant because the data set is prodomantly UK and US where drinking alcohol is very common regardless of traits of a person making it not a useful predictor.